Search CORE

47 research outputs found

COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

Author: Drossos Konstantinos
Favory Xavier
Serra Xavier
Virtanen Tuomas
Publication venue
Publication date: 01/01/2020
Field of study

Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. Aligning is done by maximizing the agreement of the latent representations of audio and tags, using a contrastive loss. The result is an audio embedding model which reflects acoustic and semantic characteristics of sounds. We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks (namely, sound event recognition, and music genre and musical instrument classification), and investigate what type of characteristics the model captures. Our results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with our method are well correlated with some acoustic descriptors.Comment: 8 pages, 1 figure, workshop on Self-supervision in Audio and Speech at the 37th International Conference on Machine Learning (ICML), 2020, Vienna, Austri

arXiv.org e-Print Archive

Trepo - Institutional Repository of Tampere University

FSD50K: an Open Dataset of Human-Labeled Sound Events

Author: Favory Xavier
Fonseca Eduardo
Font Frederic
Pons Jordi
Serra Xavier
Publication venue
Publication date: 01/10/2020
Field of study

Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on a massive amount of audio tracks from YouTube videos and encompassing over 500 classes of everyday sounds. However, AudioSet is not an open dataset---its release consists of pre-computed audio features (instead of waveforms), which limits the adoption of some SER methods. Downloading the original audio tracks is also problematic due to constituent YouTube videos gradually disappearing and usage rights issues, which casts doubts over the suitability of this resource for systems' benchmarking. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research

arXiv.org e-Print Archive

UPF Digital Repository

Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

Author: Drossos Konstantinos
Favory Xavier
Serra Xavier
Virtanen Tuomas
Publication venue: IEEE
Publication date: 01/01/2021
Field of study

Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embed-dings model (WEM), and a multi-head self-attention (MHA) mechanism. MHA attends on the output of the WEM, pro-viding a contextualized representation of the tags associated with the audio, and we align the output of MHA with the out-put of the encoder of AAE using a contrastive loss. We jointly optimize AAE and MHA and we evaluate the audio representations (i.e. the output of the encoder of AAE) by utilizing them in three different downstream tasks, namely sound, music genre, and music instrument classification. Our results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.acceptedVersionPeer reviewe

UPF Digital Repository

Trepo - Institutional Repository of Tampere University

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

Author: Drosos Konstantinos
Favory Xavier
Serra Xavier
Weck Benno
Publication venue: DCASE
Publication date: 14/10/2021
Field of study

publishedVersionPeer reviewe

arXiv.org e-Print Archive

Trepo - Institutional Repository of Tampere University

Pre-Training Strategies Using Contrastive Learning and Playlist Information for Music Classification and Similarity

Author: Alonso-Jiménez Pablo
Bogdanov Dmitry
Bourdalas Grigoris
Favory Xavier
Foroughmand Hadrien
Lidy Thomas
Serra Xavier
Publication venue
Publication date: 24/04/2023
Field of study

In this work, we investigate an approach that relies on contrastive learning and music metadata as a weak source of supervision to train music representation models. Recent studies show that contrastive learning can be used with editorial metadata (e.g., artist or album name) to learn audio representations that are useful for different classification tasks. In this paper, we extend this idea to using playlist data as a source of music similarity information and investigate three approaches to generate anchor and positive track pairs. We evaluate these approaches by fine-tuning the pre-trained models for music multi-label classification tasks (genre, mood, and instrument tagging) and music similarity. We find that creating anchor and positive track pairs by relying on co-occurrences in playlists provides better music similarity and competitive classification results compared to choosing tracks from the same artist as in previous works. Additionally, our best pre-training approach based on playlists provides superior classification performance for most datasets.Comment: Accepted at the 2023 International Conference on Acoustics, Speech, and Signal Processing (ICASSP'23

arXiv.org e-Print Archive

THE ROLE OF GLOTTAL SOURCE PARAMETERS FOR HIGH-QUALITY TRANSFORMATION OF PERCEPTUAL AGE

Author: Axel Roebel
Gilles Degottex
Nicolas Obin
Xavier Favory
Publication venue
Publication date: 24/04/2020
Field of study

ABSTRACT The intuitive control of voice transformation (e.g., age/sex, emotions) is useful to extend the expressive repertoire of a voice. This paper explores the role of glottal source parameters for the control of voice transformation. First, the SVLN speech synthesizer (Separation of the Vocal-tract with the Liljencrants-fant model plus Noise) is used to represent the glottal source parameters (and thus, voice quality) during speech analysis and synthesis. Then, a simple statistical method is presented to control speech parameters during voice transformation : a GMM is used to model the speech parameters of a voice, and regressions are then used to adapt the GMMs statistics (mean and variance) to a control parameter (e.g., age/sex, emotions). A subjective experiment conducted on the control of perceptual age proves the importance of the glottal source parameters for the control of voice transformation, and shows the efficiency of the statistical model to control voice parameters while preserving a high-quality of the voice transformation

CiteSeerX

Tools and Applications for Interactive-Algorithmic Control of Sound Spatialization in OpenMusic

Author: Bresson Jean
Carpentier Thibaut
Favory Xavier
Garcia Jérémie
Schumacher Marlon
Publication venue: ZKM
Publication date: 01/01/2015
Field of study

International audienceWe present recent works carried out in the OpenMusic computer-aided composition environment for combining compositional processes with spatial audio rendering. We consider new modalities for manipulating sound spatialization data and processes following both object-based and channel-based approaches, and developed a framework linking algorithmic processing with interactive control

Goldsmiths Research Online

Improving sound retrieval in large collaborative collections

Author: Favory Xavier
Publication venue: 'Universitat Pompeu Fabra'
Publication date: 04/03/2021
Field of study

Capturing sounds on a recording medium to enable their preservation and reproduction started to be possible during the industrial revolution of the 19th century, originally achieved through mechanic and acoustic devices, and later electronic and magnetic ones. Eventually, the digital age of the mid-20th century brought about the democratization of recording and reproduction devices, as well as accessible ways of storing and sharing content. As a consequence, massive collections of audio samples are nowadays increasingly available online, some of which are created collaboratively thanks to sharing platforms. This content has become essential for entertainment media, such as movies, music, video games, and for human-machine interaction. Nonetheless, given the amount and diversity of the content, exploring, searching and retrieving from collaborative collections becomes increasingly challenging. Methods for automatically organizing content, and facilitating its retrieval therefore become more and more necessary, creating an opportunity for novel Information Retrieval approaches. This thesis aims at improving the retrieval of sounds in large collaborative collections, and does so from different perspectives. We first investigate data collection methodologies for creating large and sustainable audio datasets, including the design and development of a website and an annotation tool to engage users in the collaborative process of dataset creation. Additionally, we focus on improving the manual annotation of audio samples when using large taxonomies. This calls for specialized tools to assist users towards providing exhaustive and consistent annotations. This produced a number of publicly available large-scale datasets for developing and evaluating machine listening models. From another perspective, we propose novel methods for learning audio representations, suitable for diverse machine learning applications, by taking advantage of large amounts of online content and its metadata. We then investigate the problem of unsupervised classification by first identifying which type of audio features are suited for clustering the wide variety of sounds present in online collections. Finally, we focus on Search Results Clustering, an approach that organizes the search results into coherent groups. This research improved the retrieval of sounds from large collections, namely through facilitating exploration and interaction with search results.A mediados del siglo XIX, y más precisamente durante la segunda Revolución Industrial, comenzó a ser posible la captura de sonidos gracias a un soporte de grabación, permitiendo su conservación y su reproducción. En un principio, esto se logró gracias a dispositivos mecánicos y acústicos, y posteriormente éstos fueron electrónicos y magnéticos. Finalmente, a mediados del siglo XX, la era digital trajo consigo la democratización de los dispositivos de grabación y de reproducción, así como el acceso a otras formas de almacenamiento y de compartimiento de contenido. Como consecuencia, hoy en día, aumentan las colecciones disponibles en línea. Se trata de colecciones masivas de muestras de audio, algunas de las cuales se crean de forma colaborativa, gracias a las plataformas de intercambio. Este contenido ha llegado a ser imprescindible para los medios de entretenimiento, como películas, música, videojuegos y para la interacción hombre-máquina. No obstante, dada la cantidad y la diversidad existentes, explorar, buscar y recuperar contenido de colecciones colaborativas es cada vez más difícil. Así, los métodos para organizar automáticamente el contenido y facilitar su recuperación, son cada vez más necesarios. Esta situación es una oportunidad para el estudio de enfoques novedosos cuyo objetivo es la recuperación de información. Desde diferentes perspectivas, esta tesis tiene como objetivo el facilitar la recuperación de sonidos ubicados en grandes colecciones colaborativas, En primer lugar, investigamos los métodos de recopilación de datos para crear grandes conjuntos sostenibles de datos de audio, incluido el diseño y el desarrollo de una aplicación web y de una herramienta de anotación para involucrar a los usuarios en el proceso colaborativo de creación de conjuntos de datos. Además, nuestro trabajo se enfoca hacia la mejora de la anotación manual de muestras de audio, cuando usamos taxonomías grandes. Esta operación requiere herramientas especializadas que faciliten las anotaciones exhaustivas y consistentes. El resultado es la producción de una serie de conjuntos de datos a gran escala disponibles, a nivel público, que permiten desarrollar y evaluar modelos de aprendizaje de máquinas. Desde una perspectiva original, proponemos métodos novedosos y adecuados para, en primer lugar, aprender representaciones de audio y, en segundo lugar, para realizar diversas aplicaciones de aprendizaje automático, aprovechando grandes cantidades de contenido en línea y sus metadatos. En segundo lugar, investigamos el problema de la clasificación sin supervisión, identificando qué tipo de características de audio son las adecuadas para agrupar la amplia variedad de sonidos presentes en las colecciones en línea. Por último, nos centramos en la agrupación de resultados de búsqueda, un enfoque que organiza los resultados en grupos coherentes. Esta investigación facilita la recuperación de sonidos de grandes colecciones, principalmente, al facilitar la exploración y la interacción con los resultados de búsqueda

Tesis Doctorals en Xarxa

Multi web audio sequencer: collaborative music making

Author: Favory Xavier
Serra Xavier
Publication venue
Publication date: 01/01/2018
Field of study

Comunicació presentada a: Web Audio Conference WAC-2018, celebrat del 19 al 21 de setembre de 2018 a Berlin, Alemanya.Recent advancements in web-based audio systems have enabled sufficiently accurate timing control and real-time sound processing capabilities. Numerous specialized music tools, as well as digital audio workstations, are now accessible from browsers. Features such as the large accessibility of data and real-time communication between clients make the web attractive for collaborative data manipulation. However, this innovative field has yet to produce effective tools for multiple-user coordination on specialized music creation tasks. The Multi Web Audio Sequencer is a prototype of an application for segment-based sequencing of Freesound sound clips, with an emphasis on seamless remote collaboration. In this work we consider a fixed-grid step sequencer as a probe for understanding the necessary features of crowd-shared music creation sessions. This manuscript describes the sequencer and the functionalities and types of interactions required for effective and attractive collaboration of remote people during creative music creation activities.This work has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 688382 \AudioCommons"

arXiv.org e-Print Archive

UPF Digital Repository

Multi web audio sequencer: collaborative music making

Author: Favory Xavier
Serra Xavier
Publication venue
Publication date
Field of study

RECERCAT